GENIA corpus - a semantically annotated corpus for bio-textmining

نویسندگان

Jin-Dong Kim

Tomoko Ohta

Yuka Tateisi

Jun'ichi Tsujii

چکیده

MOTIVATION Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. RESULTS GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400,000 words and almost 100,000 annotations for biological terms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain

With the information overload in genome-related field, there is an infreest need for natural language processing technology to extract information from literature and various attempts of information extraction using NLP has been being made. We are developing the necessary resources including domain ontology and annotated corpus from research abstracts in MEDLINE database (GENIA corpus). We are ...

متن کامل

Steps towards a GENIA Dependency Treebank

In this paper we describe on-going work aimed at creating a dependency-based annotated treebank for the BioMedical domain. Our starting point is the GENIA corpus [14], which is a corpus of 2000 MEDLINE abstracts, which has been manually annotated for various biological entities, according to the GENIA Ontology.1 There is an exponential growth of published research in this sector, which makes it...

متن کامل

A Semantically Annotated Swedish Medical Corpus

With the information overload in the life sciences there is an increasing need for annotated corpora, particularly with biological and biomedical entities, which is the driving force for data-driven language processing applications and the empirical approach to language study. Inspired by the work in the GENIA Corpus, which is one of the very few of such corpora, extensively used in the biomedi...

متن کامل

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain

We report the construction of a corpus for parser evaluation in the biomedical domain. A 50-abstract subset (492 sentences) of the GENIA corpus (Kim et al., 2003) is annotated with labeled head-dependent relations using the grammatical relations (GR) evaluation scheme (Carroll et al., 1998) ,which has been used for parser evaluation in the newswire domain.

متن کامل

Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus

It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagse...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bioinformatics

دوره 19 Suppl 1 شماره

صفحات -

تاریخ انتشار 2003

GENIA corpus - a semantically annotated corpus for bio-textmining

نویسندگان

چکیده

منابع مشابه

The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain

Steps towards a GENIA Dependency Treebank

A Semantically Annotated Swedish Medical Corpus

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain

Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus

عنوان ژورنال:

اشتراک گذاری